Towards the end of 2023 I applied and got accepted into the MSc Data Science program at the University of Cape Town. It was this new journey that inspired me to embark on a little side project of my own. The idea behind the project was to pursue a task that would involve all stages of the data analysis pipeline. From data collection, to data cleaning, to visualising the data and finally using it to tell a story. I also envisioned it as an opportunity to put into practical practice what I would’ve learned throughout the year as part of my MSc.
I chose the focus of my project to be analysing the times I clocked in and out at the office. I figured this would be interesting to analyse especially considering that as part of my MSc I would have lectures that would sometimes eat into my usual working hours. That coupled with other commitments that come up in life sounded like it would make for an interesting analysis and that’s how this project was born.
This notebook demonstrates how I went about cleaning, analysing and plotting the data. A more detailed post will follow on my blog which will delve more into the data collection process and follow a more storytelling narrative.
Read in data
The data come from a Notion database as a CSV and contain fields for the date of interest, the time I clocked in and out respectively on said date and any notes detailing anything of interest that occurred on said date e.g a lecture, a pre-work commitment, a conference, etcetera.
We begin by reading in the data and preventing automatic casting as this leads to the time fields being cast to a difftime object which, for our purposes isn’t conducive for analysis. Therefore we use the col_types argument to retain all data as character vectors. We also convert the data to a tibble for the nice qualities that tibbles provide.[1]
The data contain 250 entries where each entry is a day and all fields are stored as character vectors. The notes field is a multi-select field and the various options for it are a comma separated character vector of any number of the following:
Work from home
Pre-work commitment
Post-work commitment
Annual leave
Sick leave
Study leave
Public holiday
Conference
11:00 lecture
14:00 lecture
16:00 lecture
Days with the any of work from home, annual leave, sick leave, public holiday and conference will likely not have any clock in/out times since I won’t go into the office during these occasions.
Prepare data for analysis
In order to prepare the data for analysis, a few cleaning steps are necessary. We begin by renaming the columns to make them easier to work with and more representative of the data they represent.
# Rename columnsdata <- data %>%rename(date = Date,clock.in ="Clock In Time",clock.out ="Clock Out Time",event = Notes)data %>% head
# A tibble: 6 × 4
date clock.in clock.out event
<chr> <chr> <chr> <chr>
1 Monday, 8 January 2024 09:02 16:49 <NA>
2 Tuesday, 9 January 2024 09:22 16:36 <NA>
3 Wednesday, 10 January 2024 09:21 14:25 <NA>
4 Thursday, 11 January 2024 09:19 16:31 <NA>
5 Friday, 12 January 2024 <NA> <NA> Work from home
6 Monday, 15 January 2024 08:45 16:51 <NA>
We also add an ID column to make identifying rows easier.
# Add id columndata <- data %>%mutate(id =as.integer(rownames(data)))data %>% head
# A tibble: 6 × 5
date clock.in clock.out event id
<chr> <chr> <chr> <chr> <int>
1 Monday, 8 January 2024 09:02 16:49 <NA> 1
2 Tuesday, 9 January 2024 09:22 16:36 <NA> 2
3 Wednesday, 10 January 2024 09:21 14:25 <NA> 3
4 Thursday, 11 January 2024 09:19 16:31 <NA> 4
5 Friday, 12 January 2024 <NA> <NA> Work from home 5
6 Monday, 15 January 2024 08:45 16:51 <NA> 6
We then manually cast fields to appropriate data types. Here we take advantage of Lubridate’s excellent datetime casting capabilities. We use the parse_date_time function to cast both the date and clock in/out time fields displaying the versatility of this function and this library. This casts the fields into POSIXct class which allows us to perform various datetime functions on them. However, here we run into a characteristic of R that’s less than ideal.
When working with dates and times in R, the aforementioned POSIXct class is used and this class mandates having both a date portion and a time portion. This means that date fields are required to have some arbitrary time portion and similarly time fields are required to have some arbitrary date portion. This is why you’ll often see time fields beginning with 1970-01-01 because that is the Unix epoch, a reference point for measuring time in seconds.[2] This ultimately has no effect on functionality but is just a nuisance when it comes to displaying as it requires reformatting to exclude the unwanted portion.
# Cast date, clock.in and clock.out using Lubridatedata <- data %>%mutate(date =parse_date_time(date, orders ="dmy"),clock.in =parse_date_time(clock.in, orders ="%H:%M"),clock.out =parse_date_time(clock.out, orders ="%H:%M"),# Convert events to programmatically friendly stringsevent =gsub(" ", "_", event),event =gsub(":", "_", event),event =gsub("-", "_", event),# Replace NA eventsevent =replace(event, is.na(event), "Standard"))data %>% head
# A tibble: 6 × 5
date clock.in clock.out event id
<dttm> <dttm> <dttm> <chr> <int>
1 2024-01-08 00:00:00 0000-01-01 09:02:00 0000-01-01 16:49:00 Standard 1
2 2024-01-09 00:00:00 0000-01-01 09:22:00 0000-01-01 16:36:00 Standard 2
3 2024-01-10 00:00:00 0000-01-01 09:21:00 0000-01-01 14:25:00 Standard 3
4 2024-01-11 00:00:00 0000-01-01 09:19:00 0000-01-01 16:31:00 Standard 4
5 2024-01-12 00:00:00 NA NA Work_from_h… 5
6 2024-01-15 00:00:00 0000-01-01 08:45:00 0000-01-01 16:51:00 Standard 6
Tidy data
With those basic steps complete, we now begin moving towards getting the data into a “tidy” format. Tidy data is a standard way of structuring data that makes it easier to analyse, visualise, and model. The key principle of tidy data is that each variable forms a column, each observation forms a row, and each type of observational unit forms a table.[3]
For our use case, tidying the data will mean dealing with the event column by extracting the different possible events into their own columns. We start by splitting the fields in the event column for cases where there were multiple events in one day.
# Separate options from event into separate columnsdata.tidy <- data %>%separate_rows(event, sep =",_")data.tidy %>%filter(id ==24) %>% head
Next we create binary indicator columns for each type of event. This moves us closer to our data being tidy with one column for each event where it takes on the value of true or false depending on if that event occured on that day or not.
We replace any NAs in the indicator columns with 0 because those values mean the same thing (that event did not occur on that day) and the data should clearly reflect that.
# Replace NAs with 0s in indicator columnsdata.tidy <- data.tidy %>%mutate(across(starts_with("event_"), ~replace(., is.na(.), 0)))data.tidy %>%select(starts_with("event_")) %>% head
We then collapse the data back into the original number of rows to keep with the requirements of tidy data. However, we do this by grouping by the ID and using the max aggregate function on the event columns which means we lose the other columns. We remedy this by then performing a full join on the original data.
# Collapse into original number of rowsdata.collapsed <- data.tidy %>%group_by(id) %>%summarise(event_Work_from_home =max(event_Work_from_home),event_Post_Work_Commitment =max(event_Post_Work_Commitment),event_Pre_Work_Commitment =max(event_Pre_Work_Commitment),event_16_00_lecture =max(event_16_00_lecture),event_14_00_lecture =max(event_14_00_lecture),event_11_00_lecture =max(event_11_00_lecture),event_Annual_leave =max(event_Annual_leave),event_Sick_leave =max(event_Sick_leave),event_Study_leave =max(event_Study_leave),event_Public_Holiday =max(event_Public_Holiday),event_Conference =max(event_Conference),.groups ="drop" )data.collapsed %>% head
With the data now being in a tidy format, we are ready to proceed to plotting and analysing. For the first analysis we will look at clock in and out times per event. In order to do this we need to pivot the data longer such that there is one clock in/out event per row which makes this analysis easier.
# Get per event summariestime.summaries <- data.long %>%group_by(event, event.type) %>%summarise(n =n(),mean.time =format(mean(time, na.rm = T), "%H:%M"), )
`summarise()` has grouped output by 'event'. You can override using the
`.groups` argument.
datatable(time.summaries)
From this we can see that the times largely correspond with the event as expected. We will move on to look at the overall clock in/out time stats.
To do this, we could have used the built-in base R summary function, however we found that when working with time fields, it formats the fields in a less than ideal way as alluded to before. To remedy this, we built a custom summary function that calculates the statistics and then tabulates them formatting them before return the table. # Overall times stats
time.summary <-function(time.field) {# Ensure the input is in POSIXct format time.field <-as.POSIXct(time.field, format ="%H:%M")# Sort times sorted.times <-sort(time.field)# Calculate minimum and maximum min.time <-min(time.field, na.rm = T) max.time <-max(time.field, na.rm = T)# Calculate median median.time <-median(time.field, na.rm = T)# Calculate mode mode.time <-as.POSIXct(names(sort(table(sorted.times), decreasing = T)[1]), origin ="0000-01-01", tz ="UTC")# Calculate mean mean.time <-mean(time.field, na.rm = T)# Compile output formatted.summary <-tibble(statistic =c("Min", "Max", "Median", "Mode", "Mean"),time =c(format(min.time, "%H:%M"),format(max.time, "%H:%M"),format(median.time, "%H:%M"),format(mode.time, "%H:%M"),format(mean.time, "%H:%M")) )return(formatted.summary)}
We extract some useful insights from the clock in/out summaries. We can then, using the mean clock in/out times, determine how much time I spend in the office on average.
# Mean time spent at officemean.clock.in <-parse_date_time(clock.in.summary$time[5], orders ="%H:%M")mean.clock.out <-parse_date_time(clock.out.summary$time[5], orders ="%H:%M")mean.office.time <- mean.clock.out - mean.clock.inmean.office.time
Time difference of 6.133333 hours
Plot some graphs
Finally we can visualise the data. We start with a line graph that shows the clock in/out times over the year. This visualisation is useful for seeing the day to day trends but is not as effective because of the breaks that occur on days where I didn’t go into the office.
# Line graph of clock in and out timesggplot(data.combined,aes(x = date)) +geom_line(aes(y = clock.in, color ="Clock In"), linewidth =1, na.rm = T) +geom_line(aes(y = clock.out, color ="Clock Out"), linewidth =1, na.rm = T) +scale_color_manual(values =c("Clock In"="blue", "Clock Out"="red")) +labs(title ="Clock in and Clock Out Times for 2024",x ="Date",y ="Time",color ="Legend") +theme_minimal() +theme(legend.position ="bottom", plot.title =element_text(hjust =0.5))
A scatter plot works better in this scenario and allows you to still see the general trend but does not seem disjointed for the days without data.
# Scatter plot of clock in and out timesggplot(data.combined,aes(x = date)) +geom_point(aes(y = clock.in, color ="Clock In"), na.rm = T) +geom_point(aes(y = clock.out, color ="Clock Out"), na.rm = T) +scale_color_manual(values =c("Clock In"="blue", "Clock Out"="red")) +labs(title ="Clock in and Clock Out Times for 2024",x ="Date",y ="Time",color ="Legend") +theme_minimal() +theme(legend.position ="bottom", plot.title =element_text(hjust =0.5))
However, we can fit a lot more data onto this plot. In the next plot, we plot the same scatter plot as before however, we differentiate between clock in and clock out points using shape instead of colour. This frees us up to use colour for the different types of event. We also then add the mean clock in/out times as a horizontal line.
We use geom_jitter to disperse points so it is visible when there are multiple events on one day. We also use transparency to help show when this is the case.
# Plot more detailed plotplot.colours <-c("Post_Work_Commitment"="green","Pre_Work_Commitment"="grey", "16_00_lecture"="blue", "14_00_lecture"="red", "11_00_lecture"="black","Study_leave"="yellow", "Conference"="purple","Standard"="orange")plot.labels <-c("Work_from_home"="Work from home", "Post_Work_Commitment"="Post-work commitment","Pre_Work_Commitment"="Post-work commitment", "16_00_lecture"="16:00 lecture", "14_00_lecture"="14:00 lecture","Public_Holiday"="Public holiday", "Annual_leave"="Annual leave", "11_00_lecture"="11:00 lecture","Study_leave"="Study leave", "Conference"="Conference", "Sick_leave"="Sick leave", "Standard"="Standard")set.seed(2025)clock.in.out.detail <-ggplot(data.long, aes(x = date, y = time, colour = event, fill = event, shape = event.type)) +geom_jitter(size =3, width =250, height =250, alpha =0.7, na.rm = T) +scale_color_manual(values = plot.colours, labels = plot.labels) +scale_fill_manual(values = plot.colours, labels = plot.labels) +scale_shape_manual(values =c("Clock In"=21, "Clock Out"=25)) +labs(title ="Clock in and Clock Out Times for 2024",x ="Date",y ="Time",color ="Event",fill ="Event",shape ="Event Type") +theme_minimal() +theme(legend.position ="bottom", plot.title =element_text(hjust =0.5))
# Add mean times to plotclock.in.out.detail.static <- clock.in.out.detail +geom_hline(yintercept = mean.clock.in, color ="blue", linetype ="dashed") +geom_hline(yintercept = mean.clock.out, color ="red", linetype ="dashed") +geom_text(aes(x =as.POSIXct("2024-01-01"), y = mean.clock.in, label =paste(format(mean.clock.in, "%H:%M"))), color ="blue", vjust =-0.5, hjust =1.2) +geom_text(aes(x =as.POSIXct("2024-01-01"), y = mean.clock.out, label =paste(format(mean.clock.out, "%H:%M"))), color ="red", vjust =-0.5, hjust =1.2)clock.in.out.detail.static
# Save plot# ggsave("Clock In Out Chart.png", clock.in.out.detail.static, width = 2560, height = 1600, units = "px", bg = "white", scale = 1.5)
Finally, an interactive version of the above plot using Plotly.
2. R Core Team. (2024). R: A language and environment for statistical computing. R Foundation for Statistical Computing. https://www.R-project.org/
3. Wickham, H. (2014). Tidy data. J. Stat. Softw., 59(10).
Source Code
---title: "Clock In & Out Analysis"author: "Tino Muzambi"format: html: embed-resources: true page-layout: full toc: true code-fold: false code-tools: true execute: echo: true output: trueexecute: include: truebibliography: references.bibcsl: apa-numeric-superscript-brackets.csl---```{r setup, include=FALSE}knitr::opts_chunk$set(echo = TRUE)library(tidyverse)library(lubridate)library(plotly)library(DT)```# IntroductionTowards the end of 2023 I applied and got accepted into the MSc Data Science program at the University of Cape Town. It was this new journey that inspired me to embark on a little side project of my own. The idea behind the project was to pursue a task that would involve all stages of the data analysis pipeline. From data collection, to data cleaning, to visualising the data and finally using it to tell a story. I also envisioned it as an opportunity to put into practical practice what I would've learned throughout the year as part of my MSc.I chose the focus of my project to be analysing the times I clocked in and out at the office. I figured this would be interesting to analyse especially considering that as part of my MSc I would have lectures that would sometimes eat into my usual working hours. That coupled with other commitments that come up in life sounded like it would make for an interesting analysis and that's how this project was born.This notebook demonstrates how I went about cleaning, analysing and plotting the data. A more detailed post will follow on [my blog](https://blog.tinomuzambi.com) which will delve more into the data collection process and follow a more storytelling narrative.# Read in dataThe data come from a Notion database as a CSV and contain fields for the date of interest, the time I clocked in and out respectively on said date and any notes detailing anything of interest that occurred on said date e.g a lecture, a pre-work commitment, a conference, etcetera.We begin by reading in the data and preventing automatic casting as this leads to the time fields being cast to a `difftime` object which, for our purposes isn't conducive for analysis. Therefore we use the `col_types` argument to retain all data as character vectors. We also convert the data to a tibble for the nice qualities that tibbles provide.[@tibble]```{r read data}# Read datadata <- read_csv("data/data.csv", col_types = "cccc") %>% as_tibble()# View datadata %>% head``````{r}# Structure of the datastr(data)``````{r}unique(data$Notes)```The data contain 250 entries where each entry is a day and all fields are stored as character vectors. The notes field is a multi-select field and the various options for it are a comma separated character vector of any number of the following: - Work from home - Pre-work commitment - Post-work commitment - Annual leave - Sick leave - Study leave - Public holiday - Conference - 11:00 lecture - 14:00 lecture - 16:00 lectureDays with the any of work from home, annual leave, sick leave, public holiday and conference will likely not have any clock in/out times since I won't go into the office during these occasions.# Prepare data for analysisIn order to prepare the data for analysis, a few cleaning steps are necessary. We begin by renaming the columns to make them easier to work with and more representative of the data they represent.```{r rename columns}# Rename columnsdata <- data %>% rename( date = Date, clock.in = "Clock In Time", clock.out = "Clock Out Time", event = Notes)data %>% head```We also add an ID column to make identifying rows easier.```{r id column}# Add id columndata <- data %>% mutate(id = as.integer(rownames(data)))data %>% head```We then manually cast fields to appropriate data types. Here we take advantage of Lubridate's excellent datetime casting capabilities. We use the `parse_date_time` function to cast both the date and clock in/out time fields displaying the versatility of this function and this library. This casts the fields into POSIXct class which allows us to perform various datetime functions on them. However, here we run into a characteristic of R that's less than ideal.When working with dates and times in R, the aforementioned POSIXct class is used and this class mandates having both a date portion and a time portion. This means that date fields are required to have some arbitrary time portion and similarly time fields are required to have some arbitrary date portion. This is why you'll often see time fields beginning with 1970-01-01 because that is the Unix epoch, a reference point for measuring time in seconds.[@RDoc] This ultimately has no effect on functionality but is just a nuisance when it comes to displaying as it requires reformatting to exclude the unwanted portion.```{r cast}# Cast date, clock.in and clock.out using Lubridatedata <- data %>% mutate(date = parse_date_time(date, orders = "dmy"), clock.in = parse_date_time(clock.in, orders = "%H:%M"), clock.out = parse_date_time(clock.out, orders = "%H:%M"), # Convert events to programmatically friendly strings event = gsub(" ", "_", event), event = gsub(":", "_", event), event = gsub("-", "_", event), # Replace NA events event = replace(event, is.na(event), "Standard"))data %>% head```# Tidy dataWith those basic steps complete, we now begin moving towards getting the data into a "tidy" format. Tidy data is a standard way of structuring data that makes it easier to analyse, visualise, and model. The key principle of tidy data is that each variable forms a *column*, each observation forms a *row*, and each type of observational unit forms a *table*.[@Wickham2014-ju]For our use case, tidying the data will mean dealing with the event column by extracting the different possible events into their own columns. We start by splitting the fields in the event column for cases where there were multiple events in one day.```{r separate}# Separate options from event into separate columnsdata.tidy <- data %>% separate_rows(event, sep = ",_")data.tidy %>% filter(id == 24) %>% head```Next we create binary indicator columns for each type of event. This moves us closer to our data being tidy with one column for each event where it takes on the value of true or false depending on if that event occured on that day or not.```{r indicator columns}# Create indicator columns for event optionsdata.tidy <- data.tidy %>% mutate(event_Work_from_home = if_else(event == "Work_from_home", TRUE, FALSE), event_Post_Work_Commitment = if_else(event == "Post_Work_Commitment", TRUE, FALSE), event_Pre_Work_Commitment = if_else(event == "Pre_Work_Commitment", TRUE, FALSE), event_16_00_lecture = if_else(event == "16_00_lecture", TRUE, FALSE), event_14_00_lecture = if_else(event == "14_00_lecture", TRUE, FALSE), event_11_00_lecture = if_else(event == "11_00_lecture", TRUE, FALSE), event_Annual_leave = if_else(event == "Annual_leave", TRUE, FALSE), event_Sick_leave = if_else(event == "Sick_leave", TRUE, FALSE), event_Study_leave = if_else(event == "Study_leave", TRUE, FALSE), event_Public_Holiday = if_else(event == "Public_Holiday", TRUE, FALSE), event_Conference = if_else(event == "Conference", TRUE, FALSE))data.tidy %>% select(starts_with("event_")) %>% head```We replace any NAs in the indicator columns with 0 because those values mean the same thing (that event did not occur on that day) and the data should clearly reflect that.```{r replace nas}# Replace NAs with 0s in indicator columnsdata.tidy <- data.tidy %>% mutate(across(starts_with("event_"), ~ replace(., is.na(.), 0)))data.tidy %>% select(starts_with("event_")) %>% head```We then collapse the data back into the original number of rows to keep with the requirements of tidy data. However, we do this by grouping by the ID and using the max aggregate function on the event columns which means we lose the other columns. We remedy this by then performing a full join on the original data.```{r collapse}# Collapse into original number of rowsdata.collapsed <- data.tidy %>% group_by(id) %>% summarise( event_Work_from_home = max(event_Work_from_home), event_Post_Work_Commitment = max(event_Post_Work_Commitment), event_Pre_Work_Commitment = max(event_Pre_Work_Commitment), event_16_00_lecture = max(event_16_00_lecture), event_14_00_lecture = max(event_14_00_lecture), event_11_00_lecture = max(event_11_00_lecture), event_Annual_leave = max(event_Annual_leave), event_Sick_leave = max(event_Sick_leave), event_Study_leave = max(event_Study_leave), event_Public_Holiday = max(event_Public_Holiday), event_Conference = max(event_Conference), .groups = "drop" )data.collapsed %>% head``````{r combine}# Combine dataframesdata.combined <- full_join(data.collapsed, data)data.combined %>% head```# Compile summaries of dataWith the data now being in a tidy format, we are ready to proceed to plotting and analysing. For the first analysis we will look at clock in and out times per event. In order to do this we need to pivot the data longer such that there is one clock in/out event per row which makes this analysis easier.```{r pivot longer}# Pivot longerdata.long <- data.combined %>% gather(key = "event.type", value = "time", clock.in, clock.out) %>% mutate(event.type = ifelse(event.type == "clock.in", "Clock In", "Clock Out"), event = str_split(event, ",_")) %>% unnest(event)data.long %>% head``````{r time summaries}# Get per event summariestime.summaries <- data.long %>% group_by(event, event.type) %>% summarise( n = n(), mean.time = format(mean(time, na.rm = T), "%H:%M"), )datatable(time.summaries)```From this we can see that the times largely correspond with the event as expected. We will move on to look at the overall clock in/out time stats.To do this, we could have used the built-in base R summary function, however we found that when working with time fields, it formats the fields in a less than ideal way as alluded to before. To remedy this, we built a custom summary function that calculates the statistics and then tabulates them formatting them before return the table. # Overall times stats```{r time summary function}time.summary <- function(time.field) { # Ensure the input is in POSIXct format time.field <- as.POSIXct(time.field, format = "%H:%M") # Sort times sorted.times <- sort(time.field) # Calculate minimum and maximum min.time <- min(time.field, na.rm = T) max.time <- max(time.field, na.rm = T) # Calculate median median.time <- median(time.field, na.rm = T) # Calculate mode mode.time <- as.POSIXct(names(sort(table(sorted.times), decreasing = T)[1]), origin = "0000-01-01", tz = "UTC") # Calculate mean mean.time <- mean(time.field, na.rm = T) # Compile output formatted.summary <- tibble( statistic = c("Min", "Max", "Median", "Mode", "Mean"), time = c(format(min.time, "%H:%M"), format(max.time, "%H:%M"), format(median.time, "%H:%M"), format(mode.time, "%H:%M"), format(mean.time, "%H:%M")) ) return(formatted.summary)}```## Clock in summary```{r clock in summary}clock.in.summary <- time.summary(data$clock.in)datatable(clock.in.summary)```## Clock out summary```{r clock out summary}clock.out.summary <- time.summary(data$clock.out)datatable(clock.out.summary)```We extract some useful insights from the clock in/out summaries. We can then, using the mean clock in/out times, determine how much time I spend in the office on average.```{r mean office time}# Mean time spent at officemean.clock.in <- parse_date_time(clock.in.summary$time[5], orders = "%H:%M")mean.clock.out <- parse_date_time(clock.out.summary$time[5], orders = "%H:%M")mean.office.time <- mean.clock.out - mean.clock.inmean.office.time```# Plot some graphsFinally we can visualise the data. We start with a line graph that shows the clock in/out times over the year. This visualisation is useful for seeing the day to day trends but is not as effective because of the breaks that occur on days where I didn't go into the office.```{r line graph, fig.width=13, fig.height=6, out.width="100%", out.height="auto"}# Line graph of clock in and out timesggplot(data.combined, aes(x = date)) + geom_line(aes(y = clock.in, color = "Clock In"), linewidth = 1, na.rm = T) + geom_line(aes(y = clock.out, color = "Clock Out"), linewidth = 1, na.rm = T) + scale_color_manual(values = c("Clock In" = "blue", "Clock Out" = "red")) + labs(title = "Clock in and Clock Out Times for 2024", x = "Date", y = "Time", color = "Legend") + theme_minimal() + theme(legend.position = "bottom", plot.title = element_text(hjust = 0.5))```A scatter plot works better in this scenario and allows you to still see the general trend but does not seem disjointed for the days without data.```{r scatter plot, fig.width=13, fig.height=6, out.width="100%", out.height="auto"}# Scatter plot of clock in and out timesggplot(data.combined, aes(x = date)) + geom_point(aes(y = clock.in, color = "Clock In"), na.rm = T) + geom_point(aes(y = clock.out, color = "Clock Out"), na.rm = T) + scale_color_manual(values = c("Clock In" = "blue", "Clock Out" = "red")) + labs(title = "Clock in and Clock Out Times for 2024", x = "Date", y = "Time", color = "Legend") + theme_minimal() + theme(legend.position = "bottom", plot.title = element_text(hjust = 0.5))```However, we can fit a lot more data onto this plot. In the next plot, we plot the same scatter plot as before however, we differentiate between clock in and clock out points using shape instead of colour. This frees us up to use colour for the different types of event. We also then add the mean clock in/out times as a horizontal line.We use `geom_jitter` to disperse points so it is visible when there are multiple events on one day. We also use transparency to help show when this is the case.```{r detailed scatter plot, fig.width=12, fig.height=6, out.width="100%", out.height="auto"}# Plot more detailed plotplot.colours <- c("Post_Work_Commitment" = "green", "Pre_Work_Commitment" = "grey", "16_00_lecture" = "blue", "14_00_lecture" = "red", "11_00_lecture" = "black", "Study_leave" = "yellow", "Conference" = "purple", "Standard" = "orange")plot.labels <- c("Work_from_home" = "Work from home", "Post_Work_Commitment" = "Post-work commitment", "Pre_Work_Commitment" = "Post-work commitment", "16_00_lecture" = "16:00 lecture", "14_00_lecture" = "14:00 lecture", "Public_Holiday" = "Public holiday", "Annual_leave" = "Annual leave", "11_00_lecture" = "11:00 lecture", "Study_leave" = "Study leave", "Conference" = "Conference", "Sick_leave" = "Sick leave", "Standard" = "Standard")set.seed(2025)clock.in.out.detail <- ggplot(data.long, aes(x = date, y = time, colour = event, fill = event, shape = event.type)) + geom_jitter(size = 3, width = 250, height = 250, alpha = 0.7, na.rm = T) + scale_color_manual(values = plot.colours, labels = plot.labels) + scale_fill_manual(values = plot.colours, labels = plot.labels) + scale_shape_manual(values = c("Clock In" = 21, "Clock Out" = 25)) + labs(title = "Clock in and Clock Out Times for 2024", x = "Date", y = "Time", color = "Event", fill = "Event", shape = "Event Type") + theme_minimal() + theme(legend.position = "bottom", plot.title = element_text(hjust = 0.5))``````{r scatter mean times, fig.width=12, fig.height=6, out.width="100%", out.height="auto"}# Add mean times to plotclock.in.out.detail.static <- clock.in.out.detail + geom_hline(yintercept = mean.clock.in, color = "blue", linetype = "dashed") + geom_hline(yintercept = mean.clock.out, color = "red", linetype = "dashed") + geom_text(aes(x = as.POSIXct("2024-01-01"), y = mean.clock.in, label = paste(format(mean.clock.in, "%H:%M"))), color = "blue", vjust = -0.5, hjust = 1.2) + geom_text(aes(x = as.POSIXct("2024-01-01"), y = mean.clock.out, label = paste(format(mean.clock.out, "%H:%M"))), color = "red", vjust = -0.5, hjust = 1.2)clock.in.out.detail.static``````{r}# Save plot# ggsave("Clock In Out Chart.png", clock.in.out.detail.static, width = 2560, height = 1600, units = "px", bg = "white", scale = 1.5)```Finally, an interactive version of the above plot using Plotly.```{r}ggplotly(clock.in.out.detail, tooltip =c("x", "y", "colour", "shape"))```